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ABSTRACT 

Markov Chain Monte Carlo (MCMC) proves to be powerful for Bayesian inference 
and in particular for exoplanet radial velocity fitting because MCMC provides more 
statistical information and makes better use of data than common approaches like chi- 
square fitting. However, the non-linear density functions encountered in these problems 
can make MCMC time-consuming. In this paper, we apply an ensemble sampler re- 
specting affine invariance to orbital parameter extraction from radial velocity data. This 
new sampler has only one free parameter, and it does not require much tuning for good 
performance, which is important for automatization. The autocorrelation time of this 
sampler is approximately the same for all parameters and far smaller than Metropolis- 
Hastings, which means it requires many fewer function calls to produce the same number 
of independent samples. The affine-invariant sampler speeds up MCMC by hundreds of 
times compared with Metropolis-Hastings in the same computing situation. This novel 
sampler would be ideal for projects involving large datasets such as statistical investiga- 
tions of planet distribution. The biggest obstacle to ensemble samplers is the existence 
of multiple local optima; we present a clustering technique to deal with local optima by 
clustering based on the likelihood of the walkers in the ensemble. We demonstrate the 
effectiveness of the sampler on real radial velocity data. 

Subject headings: methods: data analysis — methods: numerical — methods: statis- 
tical — planetary systems — stars: individual (HIP36616, HIP88048) — techniques: 
radial velocities 
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Introduction 



Markov Chain Monte Carlo has pro ven very helpful in astrophysics for searching, optimizing, 

In most cases, astrophysicists use 



and sampling probabilit y distributions (IBrooks etal 



2011 



the Metropolis-Hastings (jGilks etaLlll996l ) algorithm with a carefully tuned proposal distribution. 
Those who have departed from Metropolis-Hastings often have done so not to improv e speed but to 



improve the proper t ies of the sampler fo r computing the Bayesian evidence integral (IGregory etal 



1992 



Loredo 



1999 



Wandelt et oZ] 12004 ). 



In particular, MCMC has proven to be powerful for identification of exoplanet signals in radial 
veloci ty (RV) data , and to quanti f y the reliability a nd accuracy of exoplanet parameter inferences 



(e. g., iFord 1 12003 : iDriscoll 



2006 



Ford et al\ 120061 ) ■ MCMC produces sam ples of the posterior 



distr i bution of the orb ital parameters of each exoplanet around its target star ([Gregory 
20061: 



2005 



Ford 



Ford aZ.1 120061 ) . In turn, the sampling permits trivial (approximate) posterior probability 



calculations and parameter marginalization, to obtain the distribution of individual parameter or 
the joint distribution of several parameters. The standard Metropolis or Gibbs sampling approaches 
have several practical difficulties. One is that they have problem-specific computational parameters, 
and possibly a large numbe r of them, especially those highly correlated, that require hand tuning 
(|Ford 1120061 ; IGregory Il201ll ). Even with optimally tuned diagonal parameters, these methods can 
be very slow on some datasets. And of course, generically, there are local optima in the likelihood 
function that correspond to incorrect companion identifications, see Figure [TJ 



In this paper, we apply a new MCMC method (jGilks e£alJll994 ; iGoodman e£ al.l l2010i ) — an 
ensemble sampler — to radial-velocity analysis. This ensemble sampler has the property of being 
invariant under affine transformations (see below for precise definitions). This means that the 
new sampler automatically, and without hand tuning, works in the optimal linear rescaling of the 
problem. This results in much smaller autocorrelation times, which means that one gets the same 
amount of information about the posterior with fewer evaluations of the likelihood function. Our 
code has two other features that proved indispensable in analyzing exoplanet RV data. The first is a 
simulated annealing cooling schedule that we use to generate the initial sample. The second feature 
is a simple clustering method that identifies and removes samples from irrelevant local optima. 



Model 



Our goal is to determine what configurations of planets around a target star are consistent 
with given radial velocity measurements. The physical model consists of a collection o f Keplerian 



orbit s . The orb i t of each planet is para meterized by five orbital parameters (jFord 



2006 



Gregory 



2005 



2006 



Ford etal 



Ohta etal\ 120051 ) . This is consistent with the approximation that on the 



observational time scale of a few years, interactions between the orbits are negligible otherwise the 
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orbits would not be stable. This leads to 

Vrad(t) = Vq + ^ Av p (t) > (!) 
V 

where vq is a constant velocity offset, p labels the planet and Av p (t) is the perturbation contributed 
by planet p. The perturbation is given by 

Av p (t) = A p [sm(f p + Wp) + e p sin -cup], (2) 

where A is the velocity amplitude, w is the longitude of periastron, e is eccentricity and / is the 
true anomaly, which depends on time. The amplitude A can be given by 

(3) 



m s + m p y/\ - e 2 ' 

where m p is the mass of the planet, m s is the mass of the star, to is the mean angular velocity (2ir/P, 
where P is the period), a is the distance between the star and the planet and i is the inclination 
between normal direction of the orbit and the line of sight. The true anomaly / is related to the 
eccentric anomaly E through 

cos E - e 

cos/ = - 4 

1 — e cos E 

and E is related to the mean anomaly M through 

M = E-e$nxE (5) 

where M = u)t-\- (ft, (ft is the phase of pericenter passage. Putting the equations above together, we 
get 

Av p (t) = ^4 p [sin (tft(oj p t + <ft p , e p ) + w p ) + e p sin w p ], (6) 

where tft is a function determined by and ©. 

There are 5 independent parameters for each planet in ([6]), which are A p , u p , (ft p , e p and w p . 
We use an equivalent set of parameters, oj, A c = \^Acos(ft, A s = yAsincft, e c = yfecosw, and 
e s = \/esinw. Although our MCMC method is not effected by linear changes of variables, this 
nonlinear change of variables did improve its performance. We believe this is because it bounds (ft 
and w between and 2ir periodically, thus making full use of machine precision. Another possible 
change of parameters is replacing (ft in A c and A s with (ft + zu, leaving the others (cj, e c , e s ) 
unchanged. This re-parametrization makes the joint phase uncertainty in (ft and w less nonlinear, 
because (ft and w are linearly correlated, see Figure [3 However, any nonlinear change of variables 
in parameter space introduces a Jacobian factor in the prior and likelihood. Replacing (ft in A c 
and A s with (ft + w causes a complicated Jacobian factor, while for u, A c , A s , e c and e s , which is 
the re-parametrization we finally used, the Jacobian factor is just 1. For n planets, there are 5n 
parameters. Our model has two additional parameters. One is the reference velocity offset vq. The 
other is the jitter, s, described in the following paragraph. (If there are multiple observatories or 
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instruments, there will be multiple velocity offsets and jitters, which can be easily included in the 
code.) Together, these form the M = 5n + 2 components of our parameter vector, 8 6 M M . 

The data for an estimation consists of N observations of the radial velocity of a given star. 
For observation i, t{ (MJD in d) is the observation time, vi (ms _1 ) is the observed radial velocity, 
and <7j (ms _1 ) is the reported standard deviation of the observation error. This is the observer's 
estimate of the measurement noise, induced by the instrument. It does not include astrophysical 
noise such as stellar oscillation. We denote the data collectively as D = (ti, v%, <ti, . . . , tjy, vn, cjv)- 
Our data model assumes that observed velocities are independent and Gaussian with mean v ra d{ti) 
and variance af + s 2 . The parameter s is the jitter referred to above which consists of all the noises 
not included in the measurement noise estimation a. Jitter increases the posterior uncer tainties of 



the p arameters, so the parameter estimation is less optimistic and thus more conservative (j Gregory 



2005). The probability density to observe data D, given parameters 8, therefore is given by the 
Gaussian likelihood function 



p(D\9) = (2n) N / 2 



N 



1/2 



cxp 



N , 

E- 

i=l 



V ra d{ti)f 



2 af + S s 



(7) 



The parameters 6 enter into this formula through the dependence of v m d on the parameters. It is 
helpful to write this in terms of the negative log of the denormalized likelihood function 



N 1 

l(D\0) = Eo 



(Vj ~ Vrad{ti)f 
2(af + S 2 ) 



+ log K 2 + s 2 )) 



(8) 



And the likelihood function is p(D\9) oc e 



-1{D\6) 



We assume that the parameter vector 6 has a prior distribution in which the individual param- 
eters are independent random variables whose distribution is given in Table [TJ For demonstration 
purposes, we used a rather conventional non-informative prior. Planet identifications would be 
more reliable with a prior that better reflects our present understanding and expectations about 
the statistics of exoplanets. We will leave that as a topic for future work. The prior density, p(9), 
is the product of the appropriate densities over the M = 5n + 2 components of 8. The posterior 
density of 8 given the data is 

p(9)p(D\9) 



p{9\D) 

In the present work, the normalization constant 



p(D) 



p(D)= p(8)p(D\8)d8 



(9) 



is unknown and not used. 
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3. Sampling 



AfRne Invariance We use an ensemble sampling method that respects affine invariance. An 



affine transformation is an invertible mapping 



with the form ip = S9 + b where S is a 



non-singular matrix. Many Monte Carlo methods use diagonal or off diagonal scaling matrices S 
to improve performance. That is irrelevant for affine invariant methods. If random variable 9 has 
probability density ir(9), then ip = S9 + b has density 



ns,b(ip) = ^s,b(S9 + b) oc 7r(0). 

A sampler is affine invariant if the MCMC transition probability density P{9 
transforms in the same way: 

P$ | $ = C s ,bP{9' | 9) , 



(10) 



P{9' 



if ^ = S9' + b and ip = S9 + b, where C$b is a normalizing determinant independent of 9 and 
0'. In our codes, this is a consequence of the fact that if the code starting with 9\ would produce 
02, @3, ■ ■ ■, then starting with ipi, it would produce ip2, ip3, ■ ■ ■> always with tpj = S9j + b. 

Co mmon MCMC samplers re quire certain customization to sample ill-shaped densities effi- 
ciently (jFord 1120061 ; iGregorv 11201 ll ) . For the density such as the left side of Figure [3 single variable 
Gibbs sampler updates and isotropic Metropolis samplers require a small step size to achieve a 
reasonable acceptance probability. But an affine transformation can turn the highly elliptical den- 
sity on the left to something more spherical, as on the right. The affine invariant MCMC sampler 
does not view the ill-shaped density on the left more difficult than the well-shaped density on 
the right (jGoodman etadl2010l ). This makes it efficient on many ill-shaped densities, without cus- 
tomization. 



Ensemble Sampling We use an ensemble sampler that updates L samples together. An ensem- 
ble, E, consists of L samples, or walkers, 9^. The walkers are samples of the posterior parameter 
distribution, 9^ G The ensemble E = (9\,..., 9£) of L walkers is a point in The target 

probability density for the ensemble is 

R{E)=p(9[\D) P {9" 2 \D)---p(9l\D). (11) 

That is to say that each walker is an independent sampler of the posterior. 



Stretch Move In the ensemble MCMC sampler, one step of the Markov chain E(t) — > E(t + 1) 
consists of updating all walkers one by one as a whole cycle. Expressed in pseudo code, it is 



for k = 1 , . . . , L 

update 9 k {t) -> 9 k (t + 1) 
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The update step preserves the ensemble sampling density (jlip . which means that if 
E[k-i]V) = + • + 1),4(*), • • • 

has density (fTTI) . then has density (fTTj) too. The update of walker uses the current 

positions of the other walkers in the ensemble, which we call the complementary ensemble: 

F [k ](t) = {ei(t + 1), . . .,&_!(* + l)A+i(t), ...,&(*)} . 

This is E with its most current locations, but with walker k omitted. 

In this work, we update 9 k using the stretch move, defined as follows. The random stretch 
variable, Z, has density g(z) described below. First, choose another walker Oj G with all 

walkers in Frw(t) being equally likely. Then propose a move k — > ip = Z(0 k — Oj). The proposed 
new position of k lies on the line containing Ok and Oj, but the distance between them is stretched 
by Z. Accept if) with Metropolis probability 



mwi,^- 1 :^:zl > , (12) 



p(4(*)P)J ' 



where M is the dimension of the parameter space. If tp is accepted, then Q k (t + 1) = tp. Otherwise 
k (t + l) = k (t). 

The density of the stretch factor, g(z), must satisfy the symmetry condition 

g(-)=zg(z). (13) 



As long a s (fl~3l) is satisfied, the move E[ k -i](t) — > ^[fe](t) satisfies detailed balance for the density 
pip , see ([Goodman qZ.I|2010| ) . A simple distribution that satisfies this condition is 



g{z) = { C ^ Z£ (14) 
I 0, otherwise. 

The two parameters in the stretch move ensemble sampler are the ensemble size, L, and the stretch 
factor parameter, o. The normalization constant is C = | (y/a — . The method requires L > M 
in order to sample the entire parameter space. In the runs reported here, we took L = 1000 and 
a = 2. The results are insensitive to ensemble size. The parameters are not tuned for individual 
star datasets. 



In pseudo code, moving 0(t) —> 0{t + 1) by one stretch move is given by ([Goodman aZ.ll2010l ) 



for k = 1, . . . , L 

choose j G {1, . . . , L} randomly, with j ^ k 
generate tp = Oj + Z0 k - Oj), Z satisfying (fI31) 
accept, set 9 k = ip, with probability (fT2jl 
otherwise reject, leave 9 k unchanged. 
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4. Local Optima 

Local optima of the negative log likelihood function ([8]) are a problem for many parameter 
estimation and sampling problems. Local optima correspond to parameter fits that are better than 
other nearby fits but worse than fits in different parts of parameter space. Figure [T] gives an example 
for one specific star dataset. While any valid MCMC sampler would eventually find samples near 
the global optimum, doing so may require an unpractically long run. With the initialization just 
described, after cooling (see below), walkers from the ensemble find themselves in various local 
probability wells. The computational challenge is twofold. We seek to explore parameter space 
well enough that at least some of the walkers find the well of the global optimum. We also seek to 
cluster walkers in the various wells so that we can remove walkers from all but the most important 
one, or ones. 

It is clearly undesirable to manually initialize the MCMC code star-by-star to avoid local 
optima. To be most useful, a code should find samples near the global optimum by itself. There 
seems to be no practical way to guarantee this, but our combination of ensemble sampling with 
simulated annealing and clustering does it more reliably than our earlier codes. Also, the faster 
equilibration time of the ensemble sampler allows it to move more easily from local well to local 
well. For HIP36616, we believe our algorithm has identified the global optimum. However, in other 
datasets, our algorithm failed to find fits as good as those other groups found by hand using the 
periodogram. Further progress on this issue would be very important in creating automatic data 
analysis routines that do not require individual human attention for each star. For instance, right 
now, we only use the periodogram to roughly guess the period (see below), but we could make 
further use of the periodogram in our code to improve automatization. 



Simulated Annealing Simulated annealing ( Kirkpatrick etaLlll983l ) is a Monte Carlo method 
for finding global optima in problems that have many local optima. It uses a modified posterior 
distribution with an artificial "inverse temperature" parameter, (3: 



pp0\D) oc p($)e- mg) , 



(15) 



where I is the negative log likelihood function ([5|. Small f3 allows the sampler to explore parameter 
space freely without getting stuck in local wells. The desired posterior ([7]) corresponds to j3 = 1. 
The autocovariance function, described below, decays faster at small /3. Cooling (increasing /3) 
slowly allows the walkers to find promising local optima. A cooling schedule is an algorithm that 
increases /3 during the sampling process. 

We use simulated annealing in the first phase of our sampler. We initialize the ensemble as 
a small ball centered at a random position in pa rameter space except for the dimen sions related 



to the period, for which we use the periodogram (jBretthorst 



1988 



Cumming 1120041 ) to check for 



peaks between the shortest observation interval and the total observation time to make a guess. As 
the dataset can reveal periods outside this range, this is not completely reliable. In a one-planet 
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fit, we use the period with the highest peak Phighest as the initial period guess unless the data 
shows a linear trend which indicates a long-period companion. In such cases we use Phighest /100 
as the period guess. In two-planet fit, Phighest is the initial guess for the 1st planet and for the 2nd 
planet, we use the period of the 2nd highest peak to initialize a portion of walkers and the period 
of the 3rd highest peak for another portion of walkers and so on. If the data shows a linear trend, 
we would use Phighest /100 as the guess for the 2nd planet. The number 100 is chosen empirically 
and any reasonably large number should do. Ideally, we want our initialization to be as random 
as possible to demonstrate the value of the algorithm, but not as random so that the algorithm 
is unnecessarily sabotaged by the randomness. We initialize [3 by j3 m i n = 1/N, where N is the 
size of dataset so that only one data matters on average in the beginning. The runs here used 
the following cooling schedule: (3 is increased in TV" — 1 steps and the amount is determined by 
Iffy — 1/A+l = VO^Aran) where % is the index of the steps. (3 is kept constant within every one 
A^th of the total steps. In general, we used N = 10. At the end of the cooling phase the sampler 
has been using (3 = 1 for a while, so the walkers in the ensemble are well equilibrated in their local 
optima. A typical example is the left frame of Figure [21 with several local optima clearly visible. 

Clustering Figure Q] shows the results for a particular dataset after annealing, for which about 
85% of the walkers in the ensemble are in a well that corresponds to an excellent fit. About 12% 
of the remaining walkers are in another local well that corresponds to a worse fit (see inserts). The 
negative log likelihood for this worse fit is an order of magnitude larger. The clustering phase of 
our algorithm seeks to identify clusters of walkers corresponding to the local wells, so that all but 
the important ones can be removed. The right frame of Figure [T] shows the result in this case. Only 
walkers from the global optimum remain. 

While there are many sophisticated clustering methods, we found that for our problems, a 
simple one dimensional clustering method is more effective and reliable than the others we tried, 
such as clustering in parameter space. Note that there could also be multiple local optima due to 
aliasing, they could have very similar likelihoods. In such cases, we would need more sophisticated 
clustering methods. After the annealing is finished, but before clustering, we collect the average 
negative log likelihood function for each walker. (One can also use posterior function, since the 
normalization constant doesn't matter when one is taking the difference of posterior.) 

This results in the L numbers 

1 T 

l k = -J2l(0k(t)\D). (16) 
t=i 

The idea is that Ij. is characteristic of the well walker 6^ is in, so that walkers in the same 
well will have similar Given the clustering first ranks all the walkers based on so that 
{8(i), 0(2), • • • , G(l) } is i n the order of decreasing l( k ), or increasing — log/^). There are big jumps 
in — log in this sequence (see Figure [1]) , which are fairly easy to identify and thus to separate 
the jumps. We calculated the difference in — log7(fc) for every adjacent pair of 6^) starting from 



- 9 - 



k = 1 and find the 1st pair whose difference is certain amount of times bigger than the average 
difference before. So if 

— log If + log !f 11 

- log Z (J+1) + log > Const 1 W , (17) 

all the with k > j are thrown away and only the ones with k < j are kept. 

A more rigorous approach to the multiple well/multiple fit problem would be to estimate the 
evidence integral corresponding to a well before deleting it. It could happen that a small but deep 
well has less posterior probability than a broad but shallow well. We do not find that this occurs 
in practice in the fits discussed in this paper. For example, our prior prevents very short-period 
orbits that might give excellent but spurious fits to the data. The inserts of Figure [1] suggest that 
the local wells should be removed in our problem. In future work we hope to address this issue 
more carefully by estimating the evidence integrals. Even then, clustering as we do it here will be 
the first step. 



Data and results 



Hek ker eta/.l 1200 6. 2008: iQuirrenbach eta/J 12011 



We tested our code on data from the Lick K -Giant Search (jFrink 1120021 : Mitchell et al\ 12003 ; 

In this paper, we present the results for two 



stars: HIP36616 and HIP88048. HIP36616 is interesting because the data shows a companion with 
a small period and small mass and another companion with very large period and very large mass. 
The information on the large period is incomplete which causes the Metropolis-Hastings algorithm 
difficulty in finding a good fit. HIP88048 has near complete information on both periods so it is an 
easier case for Metropolis Hastings and is used as a comparison. Other groups ran these two stars 
with common Metropolis-Hastings routines. They confirmed that they could easily find a good fit 
for HIP88048. But for HIP36616, a good fit was found only after new data were obtained. Our 
code successfully found a fit for HIP36616 before the new data came in. 

HIP36616 is interesting and challenging both from a physical and a sampling point of view. 
The RV data together with a good fit are shown in Figure The data are well explained by a 
small close companion in a roughly circular orbit and a larger and more distant companion in a 
highly elliptical orbit. The mass ratio in the posterior distribution (assuming coplanar orbits) is 



■m p2 



A 2 \/T 



72, 1/3 



rripi Ai 



< , 1 / 3 



69.8 ±1.7. 



The parameters with their confidence intervals are listed in Table Histograms of the individual 
parameters are shown in Figured! 

The second example fit with the new sampler, HIP88048, is shown in Figure The parameters 
with their confidence intervals are listed in Table [3) We also run our code on only the first half 
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of the data of HIP88048. We get similar results as with all the data, but with a larger variance, 
especially for the parameters of the planet with larger period. 



6. Comparison with Metropolis-Hastings 



All non-trivial MCMC samplers produce autocorrelated samples. An important measure of 
the effectiveness of the sampler is its covariance C(t) of the equilibrium time lag t. The longer it 
takes the covariance C(t) to decay to 0, the longer it takes the sampler to generate independent 
samples, because non-zero covariance indicates correlation between samples. To be more precise, 
suppose V{6) is some function of the parameters, such as V{6) = 9j or some nonlinear function of 
the components. The equilibrium autocovariance function is 



Cy{t) = lim cov 
to— >°o 



V(6(t + t)),V(9(t )) 



The limit to — ► °° is only to ensure that the Monte Carlo has reached steady state. The dimen- 
sionless version of this is the autocorrelation function 



Pv(t) 



C v (t) 



cov 



lim 



V(9(to + t)),V(0(to)) 



var 



V(0(t ) 



Cy(0) *o->-oo 
We used the standard estimators 

N—t 

and 



(18) 



n=0 



Pvit) 



C v (t) 



CV(0) 

Figure [6] displays the functions Pj(t), which correspond to taking V(9) to be the j-th parameter 6j. 
Note Pj(t) can be quite different for different components of 6 for the same dataset. 

The important measure of correlation for Monte Carlo is the integrated autocorrelation time 
(summed, actually) 

oo oo 

T= £ p(i) = l + 2j>(t). 

t=— oo t=l 

This is the numbe r of Monte Carlo step s needed to produce an effectively independent sample of 
the posterior, see (jGoodman al.ll2010l ) and references there. 



Metropolis Sampler Proposal and Tuning. For comparison purposes we coded a traditional 
MCMC sampler that updates the components 0j one at a time using a trial that is uniform in an 
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interval [Oj—rj,0j+rj]. The ranges rj were tuned individually for each component to give acceptance 
probability .4, which is commonly thought to be roughly optimal. The tuning parameters are 
different for each j, and for a given j they are different for each dataset. One can also use Gaussian 
distribution for trial move which should give similar result. 

It is noteworthy that for the Metropolis sampler to work best, one should tune all the free 
parameters which are on the order of the dimension squared. But how the free parameters are 
tuned highly depends on where the walker is in the parameter space. That is to say, the free 
parameters best tuned for a place far from the best-fit optimum in the parameter space could be 
quite different from those best tuned for a place close to the best-fit optimum. Even if the best- 
tuned free parameters are similar everywhere in the parameter space, the tuning workload still 
increases as fast as the parameter space dimension squared. 



Comparison. For the ensemble sampler, we used the ensemble mean 

1 L 

V(9 n (t)) = j-Y,V(e n (t)) (19) 

71=1 

in place of a single value on the right side of (|18p in the definition of p, in other word, replacing 
V in (|18p with V defined in (|19p (Note the possible confusion of notation: 9 n is component n of 
the parameter vector 9, while 9 n is the n— th parameter vector in an ensemble of L such parameter 
vectors.) It may seem unfair to compare one step of a single vector method to a single step of 
the ensemble method, given that an ensemble step requires updating L walkers rather than one. 
However, because the walkers in the ensemble are independent (see ( Hip ), doing r ensemble updates 
produces L effectively independent samples. Therefore, the ensemble sa mpler will be exactly as 



effective as the traditional one if it has the same autocorrelation function (IGoodman et a/.ll2010l ) . 



Two other factors should be mentioned. One is that the traditional sampler requires one like- 
lihood evaluation per component per update, while the ensemble sampler requires one likelihood 
evaluation per vector update. That is, the traditional sampler uses a factor of M (the number of 
parameters) more likelihood evaluations per vector update. This is a serious consideration in the 
present application, where likelihood evaluations are the most expensive part of the algorithm. To 
be sure, in many applications (though not ours) it is not so expensive to update the likelihood func- 
tion after changing a single parameter. The other is that the ensemble method is more automatic 
in that it does not require component specific or dataset specific tuning. All the runs reported here 
used parameters a = 2 and L = 1000. 

Figure [6] shows the results for the two datasets, each of which is a star with two companions. In 
both cases, the traditional sampler relaxes the components of the easier companion more quickly, 
while the ensemble sampler (which updates all parameters together) relaxes all parameters at 



This does not apply to burn in, so that ensemble may suffer more work from burn in phase. 
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roughly the same rate. In both cases, the ensemble is much more effective in relaxing the parameters 
of the difficult companion. In the harder case, HIP36616, the relaxation for the ensemble sampler 
is at least an order of magnitude faster than the relaxation for the traditional sampler. 



7. Discussion 



This paper reports progress toward the goal of making posterior sampling fast, reliable, and 
automatic for exoplanet radial velocity fitting. The sampler presented here performed well and 
automatically for all the individual star datasets we tried (about a hundred data points). Very 
high aspect ratio wells and spurious local wells all were handled without individual tuning using 
a single shell scri pt. Future projects, such as hierarchical modeling of the exoplanet distribution 
(jHogg e£adl2010l ). can become impractical with traditional samplers if they are slow . Statistical 



information about exoplan ets, such as the eccentri city distribution (IShen etal. 



2010 



Zakamska etal 



distr ibution (jSchlaufman et al\ 12009) . the brown dwarf desert (jGrether e£ a/.l 120061 : iLeconte etal. 



2008 



201C), the ma ss distribution ( Howard etal. 20101) . the 



Hogg etal. 



mass-semimajor axis 



20101 ) . and inclinations (IHo e£ al.ll2010l ). depend on rapid and reliable processing of star datasets. 
It is a limitation of this work that we have used uninformative priors, since we already have plenty 
of knowledge about exoplanets; we should infer from the collection of all exoplanets priors that 
improve our fitting for any individual new system. Hierarchical approaches make it possible to 
infer informative priors without making strong new assumptions. Hierarchical model comparison 
and selection is a future project, as are projects in which we tune our optimization strategy to 
operate rapidly and robustly on large numbers of exoplanet systems (a pre-requisite for efficient 
hierarchical modeling. 

We believe that Figure [7] at least partly explains the faster decay of correlations in the ensemble 
sampler. For the probability distribution in the left frame, traditional single variable updates must 
have small proposal steps or suffer high rejection rates. For example, if <j>\ is fixed, then w\ cannot 
move much and stay within the range of likely parameters. Samplers based on isotropic multivariate 
proposals must also take small proposal steps. But the ensemble sampler moves walkers along lines 
between themselves and other walkers, which naturally adapts the proposal steps to the geometry 
of the distribution. 

In our comparison with the Metropolis-Hastings sampler, we didn't tune the M-H sampler to 
its full extent. If we did, it is reasonable to think that the performance of the Metropolis-Hastings 
sampler — locally where it is tuned — would be the same as our ensemble sampler. But tuning free 
parameters (whose number is on the order of the number of dimensions squared) is already difficult, 
let alone the fact that best-tuned free parameters could change dramatically as the walker travels 
in the parameter space. Our ensemble sampler emulates a best-tuned Metropolis-Hastings sampler, 
best-tuned wherever the walker goes. 



One of the problems we have encountered is that our ensemble sampler is greatly affected by 
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the existence multiple local optima. If the walkers occupy different local optima, the acceptance 
ratio can be greatly lowered because the proposed moves are based on walkers from other optima. 
In this case, the proposal may be outside any well and therefore unlikely to be accepted. The 
clustering we propose and use here is a great help in this regard, and effectively overcomes this 
problem. And our runs never encountered more than one statistically significant local optima. 

The ensemble sampler may also be better suited for high performance computing. Each walker 
in the ensemble evolves in parallel. Our current implementations do not exploit this parallelism; 
more sophisticated software designed for high performance hardware would be welcome. 

We believe that more sophisticated methods from machine learning could have a large impact 
on these sampling problems. One obvious application would be more sophisticated clustering 
methods. Another would be to make more use of the information in the ensemble to build a 



model of the posterior. For example ([Goodman etal\ I201CI ) speculate that one could use half of 



the ensemble to build a nonlinear model of the posterior that would allow the other half of the 
ensemble to be re-sampled more effectively 

All of the software used for this project is available upon request. Included in this is a general 
purpose of the ensemble sampler that can be used for other sampling problems. 
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Parameter 


Variable 


Prior 


Amplitude 


A(ms _1 ) 


Jefferys 


Period 


P(s) = 2vr/w 


Jefferys 


Orbital Phase 


4> (rad) 


Uniform 


Eccentricity 


e 


Uniform 


Longitude of Periastron 


w (rad) 


Uniform 


Jitter 


s (ins -1 ) 


Jefferys 



Prior Mathematical Form 



Reference Velocity 



(A+Aq)- 1 



A + A r . 



*0 
P- 1 



log {Pmax I P-min) 

< 4> < 2vr 
< e < 1 
< w < 2vr 

(s+sp)- 1 

ln ( sp + smax ) 
V s ) 

v (ms _1 ) Uniform v 0min < v < v 0max 



Table 1: Priors for the model parameters. The hyperparameters used in code are = 10 ms , 
s = 10ms" 1 , vomin = -10000ms -1 and v 0m ax = 10000ms -1 . A max , P max and s max are in the 
normalization, so they are not actually used. 



- 16 - 



parameter 


median 


68% CI 


95% CI 


ii(ms 4 ) 


133.9640 


±3.1450 


±6.3495 


wi(radd _1 ) 


2.10296 x 10~ 2 


±0.00259 x 10- 2 


±0.00507 x 10~ 2 


0i(rad) 


5.26268 


±0.52774 


±1.24894 


ei 


0.0557283 


±0.0234535 


±0.0452732 


roi(rad) 


3.57290 


±0.51762 


±1.24018 


A 2 (ms' 1 ) 


4015.00 


±7.16 


±14.22 


W2(rad d _1 ) 


5.23189 x 10~ 4 


±0.17656 x 10~ 4 


±0.35001 x 10~ 4 


02<rad) 


4.57002 


±0.05873 


±0.11665 


&2 


0.734762 


±0.005779 


±0.011469 


tU2(rad) 


4.17465 


±0.00629 


±0.01270 


^(ms- 1 ) 


18.4083 


±1.7982 


±3.6052 


Uo(ms _1 ) 


-249.559 


±14.560 


±28.969 



Table 2: HIP36616, posterior means and confidence intervals for the 12 parameters. Note one nearly 
circular orbit (ei) and one very eccentric one (e2). 
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parameter 


median 


68% CI 


95% CI 


ii(ms _1 ) 


288.108 


±1.261 


±2.464 


Wi(rad d _1 ) 


1.18567 x 1(T 2 


±0.00048 x 10~ 2 


±0.00096 x lO" 2 


0i(rad) 


4.12983 


±0.03176 


±0.06322 


ei 


0.129846 


±0.004529 


±0.008910 


roi(rad) 


1.73223 


±0.03175 


±0.06283 


^(ms" 1 ) 


175.842 


±1.588 


±3.143 


W2(rad d _1 ) 


1.95700 x 10~ 3 


±0.02129 x 10~ 3 


±0.04227 x 10~ 3 


2 (rad) 


3.85943 


±0.04639 


±0.09299 


e2 


0.194608 


±0.012026 


±0.024027 


z^frad) 


1.76762 


±0.03943 


±0.07846 


^(ms- 1 ) 


7.7662 


±0.7582 


±1.5029 


t>o(ms _1 ) 


-39.1676 


±1.4873 


±2.9545 



Table 3: Parameters for HIP88048 
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CDF of -log(likelihood) of all the walkers 




-log(likelihood) 



Fig. 1. — CDF of (|16p . looking for n = 2 companions in the RV data for (HIP36616). There are 
M = 2 ■ 5 + 2 parameters. The ensemble size was 1000 and the Ik were computed by averaging 
over T = 100 samples. The inserts illustrate the fits corresponding to the best-fit optimum and 
two local optima. 
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Fig. 2. — Two dimensional scatterplots of the walkers in the size L = 1000 ensemble of Figure [TJ 
Plotted are the angular velocity parameters for the two companions. The left frame contains all 
the walkers in the ensemble after annealing. The right frame contains only those walkers that the 
code identified as being in the deepest well, corresponding to the best fit. 
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Fig. 3.- 



The fit of HIP36616. See also Figure Q]for worse fits that sit in deep local wells. 




Fig. 4. — The histograms of the parameter posteriors from HIP36616. 



-22- 




5.— The fit of HIP88048 



-23- 



M-H Sampler - HIP88048 M-H Sampler - HIP3661 6 




200 400 600 800 1000 200 400 600 800 1000 
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Fig. 6. — After convergence is achieved, we take chains of all the parameters to calculate the auto- 
correlation function which shows how many steps it takes to generate an independent sample. The 
autocorrelation functions for the traditional Metropolis-Hastings sampler are the 1st row and the 
2nd row is for the Affine Invariant Ensemble Sampler. Data are taken after the cooling and clus- 
tering are completed. All of the runs used 10 7 resamplings. The left two frames are for HIP88048, 
which is relatively unambiguous and easier to fit. The two right frames are for HIP36616, which 
is harder. Both sampler uesd the same parametrization described in the paper. For Metropolis- 
Hastings sampler, we tuned all the parameters so that the acceptance rate is around 0.4 in all the 
dimensions. We did not explore all the non- linear transformations of the parameters nor all the 
proposed PDFs, but this is what most people do. For our ensemble sampler, we also tuned the 
acceptance rate to be around 0.4. 




Fig. 7. — The relationship between cj) and w of both companions of HIP36616. We used Gaussian 
kernel density estimation to estimate the joint probability distribution of <p\ and w\ (left) and 4>2 
and W2 (right) from a large number of samples. 



